{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python3",
   "display_name": "DataAnalysis"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "from nltk import word_tokenize\n",
    "from tools import *\n",
    "%matplotlib inline"
   ]
  },
  {
   "source": [
    "# Ch5 分类和标注词汇\n",
    "\n",
    "1.  什么是词汇分类，在自然语言处理中它们如何使用？\n",
    "2.  对于存储词汇和它们的分类来说什么是好的 Python 数据结构？\n",
    "3.  如何自动标注文本中每个词汇的词类？\n",
    "\n",
    "-   词性标注（parts-of-speech tagging，POS tagging）：简称标注。将词汇按照它们的词性（parts-of-speech，POS）进行分类并对它们进行标注\n",
    "-   词性：也称为词类或者词汇范畴。\n",
    "-   标记集：用于特定任务标记的集合。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## Sec 5.1 使用词性标注器\n",
    "\n",
    "启发标注器(part-of-speech tagger, POS tagger)：处理一个词序列，为每个词附加一个词性标记。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "text = word_tokenize(\"And now for something completely different\")\n",
    "nltk.pos_tag(text)\n",
    "nltk.help.upenn_tagset('CC')\n",
    "nltk.help.upenn_tagset('RB')\n",
    "nltk.help.upenn_tagset('IN')\n",
    "nltk.help.upenn_tagset('NN')\n",
    "nltk.help.upenn_tagset('JJ')"
   ],
   "cell_type": "code",
   "metadata": {},
   "execution_count": 2,
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "CC: conjunction, coordinating\n    & 'n and both but either et for less minus neither nor or plus so\n    therefore times v. versus vs. whether yet\nRB: adverb\n    occasionally unabatingly maddeningly adventurously professedly\n    stirringly prominently technologically magisterially predominately\n    swiftly fiscally pitilessly ...\nIN: preposition or conjunction, subordinating\n    astride among uppon whether out inside pro despite on by throughout\n    below within for towards near behind atop around if like until below\n    next into if beside ...\nNN: noun, common, singular or mass\n    common-carrier cabbage knuckle-duster Casino afghan shed thermostat\n    investment slide humour falloff slick wind hyena override subhumanity\n    machinist ...\nJJ: adjective or numeral, ordinal\n    third ill-mannered pre-war regrettable oiled calamitous first separable\n    ectoplasmic battery-powered participatory fourth still-to-be-named\n    multilingual multi-disciplinary ...\n"
     ]
    }
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "BROWN CORPUS\n\nA Standard Corpus of Present-Day Edited American\nEnglish, for use with Digital Computers.\n\nby W. N. Francis and H. Kucera (1964)\nDepartment of Linguistics, Brown University\nProvidence, Rhode Island, USA\n\nRevised 1971, Revised and Amplified 1979\n\nhttp://www.hit.uib.no/icame/brown/bcm.html\n\nDistributed with the permission of the copyright holder,\nredistribution permitted.\n\n"
     ]
    }
   ],
   "source": [
    "print(nltk.corpus.brown.readme())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "Project Gutenberg Selections\nhttp://gutenberg.net/\n\nThis corpus contains etexts from from Project Gutenberg,\nby the following authors:\n\n* Jane Austen (3)\n* William Blake (2)\n* Thornton W. Burgess\n* Sarah Cone Bryant\n* Lewis Carroll\n* G. K. Chesterton (3)\n* Maria Edgeworth\n* King James Bible\n* Herman Melville\n* John Milton\n* William Shakespeare (3)\n* Walt Whitman\n\nThe beginning of the body of each book could not be identified automatically,\nso the semi-generic header of each file has been removed, and included below.\nSome source files ended with a line \"End of The Project Gutenberg Etext...\",\nand this has been deleted.\n\nInformation about Project Gutenberg (one page)\n\nWe produce about two million dollars for each hour we work.  The\nfifty hours is one conservative estimate for how long it we take\nto get any etext selected, entered, proofread, edited, copyright\nsearched and analyzed, the copyright letters written, etc.  This\nprojected audience is one hundred million readers.  If our value\nper text is nominally estimated at one dollar, then we produce 2\nmillion dollars per hour this year we, will have to do four text\nfiles per month:  thus upping our productivity from one million.\nThe Goal of Project Gutenberg is to Give Away One Trillion Etext\nFiles by the December 31, 2001.  [10,000 x 100,000,000=Trillion]\nThis is ten thousand titles each to one hundred million readers,\nwhich is 10% of the expected number of computer users by the end\nof the year 2001.\n\nWe need your donations more than ever!\n\nAll donations should be made to \"Project Gutenberg/IBC\", and are\ntax deductible to the extent allowable by law (\"IBC\" is Illinois\nBenedictine College).  (Subscriptions to our paper newsletter go\nto IBC, too)\n\nFor these and other matters, please mail to:\n\nProject Gutenberg\nP. O. Box  2782\nChampaign, IL 61825\n\nWhen all other email fails try our Michael S. Hart, Executive\nDirector:\nhart@vmd.cso.uiuc.edu (internet)   hart@uiucvmd   (bitnet)\n\nWe would prefer to send you this information by email\n(Internet, Bitnet, Compuserve, ATTMAIL or MCImail).\n\n******\nIf you have an FTP program (or emulator), please\nFTP directly to the Project Gutenberg archives:\n[Mac users, do NOT point and click. . .type]\n\nftp mrcnext.cso.uiuc.edu\nlogin:  anonymous\npassword:  your@login\ncd etext/etext91\nor cd etext92\nor cd etext93 [for new books]  [now also in cd etext/etext93]\nor cd etext/articles [get suggest gut for more information]\ndir [to see files]\nget or mget [to get files. . .set bin for zip files]\nget INDEX100.GUT\nget INDEX200.GUT\nfor a list of books\nand\nget NEW.GUT for general information\nand\nmget GUT* for newsletters.\n\n**Information prepared by the Project Gutenberg legal advisor**\n(Three Pages)\n\n\n***START**THE SMALL PRINT!**FOR PUBLIC DOMAIN ETEXTS**START***\nWhy is this \"Small Print!\" statement here?  You know: lawyers.\nThey tell us you might sue us if there is something wrong with\nyour copy of this etext, even if you got it for free from\nsomeone other than us, and even if what's wrong is not our\nfault.  So, among other things, this \"Small Print!\" statement\ndisclaims most of our liability to you.  It also tells you how\nyou can distribute copies of this etext if you want to.\n\n*BEFORE!* YOU USE OR READ THIS ETEXT\nBy using or reading any part of this PROJECT GUTENBERG-tm\netext, you indicate that you understand, agree to and accept\nthis \"Small Print!\" statement.  If you do not, you can receive\na refund of the money (if any) you paid for this etext by\nsending a request within 30 days of receiving it to the person\nyou got it from.  If you received this etext on a physical\nmedium (such as a disk), you must return it with your request.\n\nABOUT PROJECT GUTENBERG-TM ETEXTS\nThis PROJECT GUTENBERG-tm etext, like most PROJECT GUTENBERG-\ntm etexts, is a \"public domain\" work distributed by Professor\nMichael S. Hart through the Project Gutenberg Association at\nIllinois Benedictine College (the \"Project\").  Among other\nthings, this means that no one owns a United States copyright\non or for this work, so the Project (and you!) can copy and\ndistribute it in the United States without permission and\nwithout paying copyright royalties.  Special rules, set forth\nbelow, apply if you wish to copy and distribute this etext\nunder the Project's \"PROJECT GUTENBERG\" trademark.\n\nTo create these etexts, the Project expends considerable\nefforts to identify, transcribe and proofread public domain\nworks.  Despite these efforts, the Project's etexts and any\nmedium they may be on may contain \"Defects\".  Among other\nthings, Defects may take the form of incomplete, inaccurate or\ncorrupt data, transcription errors, a copyright or other\nintellectual property infringement, a defective or damaged\ndisk or other etext medium, a computer virus, or computer\ncodes that damage or cannot be read by your equipment.\n\nLIMITED WARRANTY; DISCLAIMER OF DAMAGES\nBut for the \"Right of Replacement or Refund\" described below,\n[1] the Project (and any other party you may receive this\netext from as a PROJECT GUTENBERG-tm etext) disclaims all\nliability to you for damages, costs and expenses, including\nlegal fees, and [2] YOU HAVE NO REMEDIES FOR NEGLIGENCE OR\nUNDER STRICT LIABILITY, OR FOR BREACH OF WARRANTY OR CONTRACT,\nINCLUDING BUT NOT LIMITED TO INDIRECT, CONSEQUENTIAL, PUNITIVE\nOR INCIDENTAL DAMAGES, EVEN IF YOU GIVE NOTICE OF THE\nPOSSIBILITY OF SUCH DAMAGES.\n\nIf you discover a Defect in this etext within 90 days of\nreceiving it, you can receive a refund of the money (if any)\nyou paid for it by sending an explanatory note within that\ntime to the person you received it from.  If you received it\non a physical medium, you must return it with your note, and\nsuch person may choose to alternatively give you a replacement\ncopy.  If you received it electronically, such person may\nchoose to alternatively give you a second opportunity to\nreceive it electronically.\n\nTHIS ETEXT IS OTHERWISE PROVIDED TO YOU \"AS-IS\".  NO OTHER\nWARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, ARE MADE TO YOU AS\nTO THE ETEXT OR ANY MEDIUM IT MAY BE ON, INCLUDING BUT NOT\nLIMITED TO WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A\nPARTICULAR PURPOSE.\n\nSome states do not allow disclaimers of implied warranties or\nthe exclusion or limitation of consequential damages, so the\nabove disclaimers and exclusions may not apply to you, and you\nmay have other legal rights.\n\nINDEMNITY\nYou will indemnify and hold the Project, its directors,\nofficers, members and agents harmless from all liability, cost\nand expense, including legal fees, that arise directly or\nindirectly from any of the following that you do or cause:\n[1] distribution of this etext, [2] alteration, modification,\nor addition to the etext, or [3] any Defect.\n\nDISTRIBUTION UNDER \"PROJECT GUTENBERG-tm\"\nYou may distribute copies of this etext electronically, or by\ndisk, book or any other medium if you either delete this\n\"Small Print!\" and all other references to Project Gutenberg,\nor:\n\n[1]  Only give exact copies of it.  Among other things, this\n     requires that you do not remove, alter or modify the\n     etext or this \"small print!\" statement.  You may however,\n     if you wish, distribute this etext in machine readable\n     binary, compressed, mark-up, or proprietary form,\n     including any form resulting from conversion by word pro-\n     cessing or hypertext software, but only so long as\n     *EITHER*:\n\n     [*]  The etext, when displayed, is clearly readable, and\n          does *not* contain characters other than those\n          intended by the author of the work, although tilde\n          (~), asterisk (*) and underline (_) characters may\n          be used to convey punctuation intended by the\n          author, and additional characters may be used to\n          indicate hypertext links; OR\n\n     [*]  The etext may be readily converted by the reader at\n          no expense into plain ASCII, EBCDIC or equivalent\n          form by the program that displays the etext (as is\n          the case, for instance, with most word processors);\n          OR\n\n     [*]  You provide, or agree to also provide on request at\n          no additional cost, fee or expense, a copy of the\n          etext in its original plain ASCII form (or in EBCDIC\n          or other equivalent proprietary form).\n\n[2]  Honor the etext refund and replacement provisions of this\n     \"Small Print!\" statement.\n\n[3]  Pay a trademark license fee to the Project of 20% of the\n     net profits you derive calculated using the method you\n     already use to calculate your applicable taxes.  If you\n     don't derive profits, no royalty is due.  Royalties are\n     payable to \"Project Gutenberg Association / Illinois\n     Benedictine College\" within the 60 days following each\n     date you prepare (or were legally required to prepare)\n     your annual (or equivalent periodic) tax return.\n\nWHAT IF YOU *WANT* TO SEND MONEY EVEN IF YOU DON'T HAVE TO?\nThe Project gratefully accepts contributions in money, time,\nscanning machines, OCR software, public domain etexts, royalty\nfree copyright licenses, and every other sort of contribution\nyou can think of.  Money should be paid to \"Project Gutenberg\nAssociation / Illinois Benedictine College\".\n\nThis \"Small Print!\" by Charles B. Kramer, Attorney\nInternet (72600.2026@compuserve.com); TEL: (212-254-5093)\n*END*THE SMALL PRINT! FOR PUBLIC DOMAIN ETEXTS*Ver.04.29.93*END*\n\n"
     ]
    }
   ],
   "source": [
    "print(nltk.corpus.gutenberg.readme())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "[('They', 'PRP'),\n",
       " ('refuse', 'VBP'),\n",
       " ('to', 'TO'),\n",
       " ('permit', 'VB'),\n",
       " ('us', 'PRP'),\n",
       " ('to', 'TO'),\n",
       " ('obtain', 'VB'),\n",
       " ('the', 'DT'),\n",
       " ('refuse', 'NN'),\n",
       " ('permit', 'NN')]"
      ]
     },
     "metadata": {},
     "execution_count": 5
    }
   ],
   "source": [
    "# 处理同形同音异义词，系统正确标注了\n",
    "# 前面的refUSE是动词，后面的REFuse是名词\n",
    "# 前面的permit是动词，后面的permit是名字\n",
    "text = word_tokenize(\"They refuse to permit us to obtain the refuse permit\")\n",
    "nltk.pos_tag(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "[('They', 'PRP'),\n",
       " ('refuse', 'VBP'),\n",
       " ('to', 'TO'),\n",
       " ('permit', 'VB'),\n",
       " ('us', 'PRP'),\n",
       " ('to', 'TO'),\n",
       " ('obtain', 'VB'),\n",
       " ('the', 'DT'),\n",
       " ('beautiful', 'JJ'),\n",
       " ('book', 'NN')]"
      ]
     },
     "metadata": {},
     "execution_count": 6
    }
   ],
   "source": [
    "text = word_tokenize(\"They refuse to permit us to obtain the beautiful book\")\n",
    "nltk.pos_tag(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "--------------- >text.similar('word')< ---------------\n",
      "time man way day year place one house car city state family room war\n",
      "problem work question country school church\n",
      "--------------- >text.similar('woman')< ---------------\n",
      "man time day year car moment world house family child country boy\n",
      "state job place way war girl work word\n",
      "--------------- >text.similar('bought')< ---------------\n",
      "made said done put had seen found given left heard was been brought\n",
      "set got that took in told felt\n",
      "--------------- >text.similar('over')< ---------------\n",
      "in on to of and for with from at by that into as up out down through\n",
      "is all about\n",
      "--------------- >text.similar('the')< ---------------\n",
      "a his this their its her an that our any all one these my in your no\n",
      "some other and\n"
     ]
    }
   ],
   "source": [
    "# 找出形如w1 w w2的上下文，然后再找出所有出现在相同上下文的词 w'，即w1 w' w2\n",
    "# 用于寻找相似的单词，因为这些单词处于相同的上下文中\n",
    "text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())\n",
    "show_subtitle(\"text.similar('word')\")\n",
    "text.similar('word')\n",
    "show_subtitle(\"text.similar('woman')\")\n",
    "text.similar('woman')\n",
    "show_subtitle(\"text.similar('bought')\")\n",
    "text.similar('bought')\n",
    "show_subtitle(\"text.similar('over')\")\n",
    "text.similar('over')\n",
    "show_subtitle(\"text.similar('the')\")\n",
    "text.similar('the')"
   ]
  }
 ]
}